\documentclass[10pt]{article}

\usepackage[letterpaper,margin=1.0in]{geometry}
\usepackage{array}
\usepackage{fancyvrb}
\usepackage{indentfirst}
\usepackage[hidelinks]{hyperref}
\usepackage{longtable}
\usepackage{multirow}
\usepackage{multicol}
\usepackage{secdot}
\usepackage{sectsty}
\usepackage{times}
\usepackage{verbatimbox}
\usepackage[dvipsnames]{xcolor}
\sectionfont{\fontsize{12pt}{12pt}\selectfont}


\begin{document}
\title{Kite: Architecture Simulator for RISC-V Instruction Set}
\author{William J. Song\\
        School of Electrical Engineering, Yonsei University\\
        \href{mailto:wjhsong@yonsei.ac.kr}{wjhsong@yonsei.ac.kr}}
\date{April 2023 (Version 1.11)}
\maketitle


\section{Introduction} \label{sec:introduction}
\emph{Kite} is an architecture simulator that models a five-stage pipeline of RISC-V instruction set \cite{waterman_riscv2019}.
The initial version of Kite was developed in 2019 for educational purposes as a part of EEE3530 Computer Architecture.
Kite implements the five-stage pipeline model described in \emph{Computer Organization and Design, RISC-V Edition: The Hardware and Software Interface} by D. Patterson and J. Hennessey \cite{patterson_morgan2017}.
The objective of Kite is to provide students with an easy-to-use simulation framework and the precise timing model described in the book.
It supports most of the basic instructions introduced in the book, such as {\tt add}, {\tt slli}, {\tt ld}, {\tt sd}, and {\tt beq} instructions.
Users can easily compose RISC-V assembly programs and execute them through the pipeline model for entry-level architecture studies.
Kite implements several basic functionalities, including instruction dependency checks (i.e., data hazards), pipeline stalls, data forwarding (optional), branch predictions (optional), data cache structures, etc.
The remainder of this document describes the Kite implementation and its usage.


\section{Prerequisite, Download, and Installation} \label{sec:install}
Kite implements the five-stage pipeline model in C++.
Since most computer architecture simulators are written in C/C++, it gives students hands-on experiences in computer architecture simulations using a simple, easy-to-use framework.
The simple implementation of Kite is easy to install.
It requires only a g++ compiler to build and does not depend on other libraries or external tools.
It was validated in Ubuntu 16.04 (Xenial), 18.04 (Bionic Beaver), 20.04 (Focal Fossa), and Mac OS 10.14 (Majave), 10.15 (Catalina), 11 (Big Sur), 12 (Monterey).
The latest release of Kite is v1.11.
To obtain a copy of Kite v1.11, use the following {\tt git} command in the terminal.
Then, enter the {\tt kite/} directory and build the simulator using a {\tt make} command.

\begin{Verbatim}[frame=single]
$ git clone --branch v1.11 https://github.com/yonsei-icsl/kite
$ cd kite/
$ make -j
\end{Verbatim}


\section{Program Code, Register and Memory States} \label{sec:inputs}
Kite takes three input files to run, i) program code, ii) register state, and iii) data memory state.
A program code is a text file containing RISC-V assembly instructions.
In the main directory of Kite (i.e., {\tt kite/}), you can find an example code named {\tt program\char`_code} that has the following RISC-V instructions.

\begin{Verbatim}[frame=single]
# Kite program code
loop:   beq  x11, x0,  exit
        remu x5,  x10, x11
        add  x10, x11, x0
        add  x11, x5,  x0
        beq  x0,  x0,  loop
exit:   sd   x10, 2000(x0)
\end{Verbatim}

A {\tt \#} symbol is reserved for comment in the program code.
Everything after the {\tt \#} sign until the end of the line is regarded as a comment and not read.
Each instruction in the program code is sequentially stored in the memory, starting from PC = 4.
For example, {\tt beq} is the first instruction in the example code above, given the PC value of 4.
The next instruction, {\tt remu}, has PC = 8 because every RISC-V instruction is 4 bytes long \cite{waterman_riscv2019}.
The code size is 24 bytes for six instructions, spanning the lowest-address memory region from 4 to 28.
Data access to the code segment is forbidden and will throw an error.

Instructions in the program code are sequentially executed from the top unless a branch or jump instruction alters the next PC.
The program ends when the next PC naturally goes out of the code segment where there exists no valid instruction or when the next PC is set to zero (i.e., PC = 0).
The PC value of zero is reserved as an invalid address.
Labels, instructions, and register operands are case-insensitive.
It means that {\tt beq}, {\tt Beq}, {\tt BEQ} are all treated identically.
The current version of Kite supports the following RISC-V instructions.
Instructions in the table are listed in alphabetical order in each category.
Pseudo instructions (e.g., {\tt mv}, {\tt not}) and ABI register names (e.g., {\tt sp}, {\tt a0}) are not supported.

\begin{longtable}{>{\centering\arraybackslash} m{0.60in}|
                  >{\centering\arraybackslash} m{1.65in}|
                  >{\centering\arraybackslash} m{3.70in}
                 }
\hline
Types                   & RISC-V instructions       & Operations in C/C++   \\ \hline \hline
\multirow{13}{*}{R type}& {\tt add  rd, rs1, rs2}   & {\tt rd = rs1 + rs2}  \\
                        & {\tt and  rd, rs1, rs2}   & {\tt rd = rs1 \& rs2} \\
                        & {\tt div  rd, rs1, rs2}   & {\tt rd = rs1 / rs2}  \\
                        & {\tt divu rd, rs1, rs2}   & {\tt rd = (uint64\char`_t)rs1 / (uint64\char`_t)rs2} \\
                        & {\tt mul  rd, rs1, rs2}   & {\tt rd = rs1 * rs2}  \\
                        & {\tt or   rd, rs1, rs2}   & {\tt rd = rs1 | rs2}  \\
                        & {\tt rem  rd, rs1, rs2}   & {\tt rd = rs1 \% rs2} \\
                        & {\tt remu rd, rs1, rs2}   & {\tt rd = (uint64\char`_t)rs1 \% (uint64\char`_t)rs2} \\
                        & {\tt sll  rd, rs1, rs2}   & {\tt rd = rs1 << rs2} \\
                        & {\tt sra  rd, rs1, rs2}   & {\tt rd = rs1 >> rs2} \\
                        & {\tt srl  rd, rs1, rs2}   & {\tt rd = (uint64\char`_t)rs1 >> rs2} \\
                        & {\tt sub  rd, rs1, rs2}   & {\tt rd = rs1 - rs2}  \\
                        & {\tt xor  rd, rs1, rs2}   & {\tt rd = rs1 \string^ rs2} \\ \hline
\multirow{9}{*}{I type} & {\tt addi rd, rs1, imm}   & {\tt rd = rs1 + imm}  \\
                        & {\tt andi rd, rs1, imm}   & {\tt rd = rs1 \& imm} \\
                        & {\tt jalr rd, imm(rs1)}   & {\tt rd = pc + 4, pc = (rs1 + imm) \& -2} \\
                        & {\tt ld   rd, imm(rs1)}   & {\tt rd = memory[rs1 + imm]} \\
                        & {\tt slli rd, rs1, imm}   & {\tt rd = rs1 << imm} \\
                        & {\tt srai rd, rs1, imm}   & {\tt rd = rs1 >> imm} \\
                        & {\tt srli rd, rs1, imm}   & {\tt rd = (uint64\char`_t)rs1 << imm} \\
                        & {\tt ori  rd, rs1, imm}   & {\tt rd = rs1 | imm}  \\
                        & {\tt xori rd, rs1, imm}   & {\tt rd = rs1 \string^ imm} \\ \hline
S type                  & {\tt sd   rs2, imm(rs1)}  & {\tt memory[rs1 + imm] = rs2} \\ \hline
\multirow{4}{*}{SB type}& {\tt beq  rs1, rs2, imm}  & {\tt pc = (rs1 == rs2 ? pc + imm<<1 : pc + 4) } \\
                        & {\tt bge  rs1, rs2, imm}  & {\tt pc = (rs1 >= rs2 ? pc + imm<<1 : pc + 4)} \\
                        & {\tt blt  rs1, rs2, imm}  & {\tt pc = (rs1 < rs2 ? pc + imm<<1 : pc + 4)} \\
                        & {\tt bne  rs1, rs2, imm}  & {\tt pc = (rs1 != rs2 ? pc + imm<<1 : pc + 4)} \\ \hline
U type                  & {\tt lui  rd, imm}        & {\tt rd = imm << 12; } \\ \hline
UJ type                 & {\tt jal  rd, imm}        & {\tt rd = pc + 4, pc = pc + imm<<1} \\ \hline
No type                 & {\tt nop}                 & No operation          \\ \hline
\end{longtable}

The program code shown in the previous page implements Euclid's algorithm that finds the greatest common divisor (GCD) of two unsigned integer numbers in {\tt x10} and {\tt x11}.
The equivalent program code in C/C++ is shown below.
For instance, suppose that {\tt x10 = 21} and {\tt x11 = 15}.
In the first iteration, the remainder of {\tt x10 / x11} (i.e., modulo operation) is stored in the {\tt x5} register, which is {\tt x5 = 6}.
Then, {\tt x11} and {\tt x5} are copied to {\tt x10} and {\tt x11}, respectively.
The second iteration makes {\tt x5 = 3} by taking the remainder of {\tt 15 / 6}, and {\tt x10 = 6} and {\tt x11 = 3}.
In the third iteration, {\tt x5 = 0}, {\tt x10 = 3}, and {\tt x11 = 0}.
The loop reaches the end when {\tt x11 == 0}.
The last code line stores the {\tt x10} value in the data memory at address {\tt 2000} (or at index {\tt 250} if the memory is viewed as a doubleword array).

\begin{Verbatim}[frame=single]
/* Equivalent program code in C/C++ */
while(x11 != 0) {       // loop: beq  x11, x0,  exit
    x5 = x10 % x11;     //       remu x5,  x10, x11
    x10 = x11;          //       add  x10, x11, x0
    x11 = x5;           //       add  x11, x5,  x0
}                       //       beq  x0,  x0,  loop
memory[250] = x10;      // exit: sd   x10, 2000(x0)
\end{Verbatim}

Register state refers to the {\tt reg\char`_state} file that contains the initial values of 32 integer registers.
The register state is loaded when a simulation starts.
The following shows a part of the register state file after the comment lines.

\begin{Verbatim}[frame=single]
# Kite register state
x0 = 0
x1 = 0
x2 = 4096
x3 = 0

...

x10 = 21
x11 = 15

...

x30 = 0
x31 = 0
\end{Verbatim}

The register state file must list all 32 integer registers from {\tt x0} to {\tt x31}.
The left of an equal sign is a register name (e.g., {\tt x10}), and the right side defines its initial value.
The example above shows that {\tt x10} and {\tt x11} registers are set to 21 and 15, respectively.
When a simulation starts, the registers are initialized accordingly.
A program code executes the instructions based on the defined register state.
Since the {\tt x0} register in RISC-V is hard-wired to zero \cite{patterson_morgan2017}, assigning non-zero values to it is automatically discarded.

Memory state refers to the {\tt memory\char`_state} file that contains the initial values of the data memory.
Every address in the memory state file must be a multiple of 8 to align with doubleword data.
The following shows an example of the memory state file.

\begin{Verbatim}[frame=single]
# Kite memory state
0 = -31
8 = -131
16 = -134
24 = -421
32 = 59

...

\end{Verbatim}

Kite creates 4KB data memory by default, and it is modifiable in {\tt proc\char`_t::init()} in {\tt proc.cc}.
Each memory block is 8 bytes long, and accesses to the data memory must obey the 8-byte alignment.
The 8-byte alignment rule precludes non-doubleword memory instructions such as {\tt lhu} and {\tt sb}.
This restriction will be lifted in future releases.
In each line of the memory state, the left of an equal sign is a memory address in multiple of 8, and the right side defines a 64-bit value stored at the memory address.
The memory state file does not have to list all memory addresses.
The values of unlisted memory addresses will be set to zero by default.

The minimum size of the data memory is 4KB, and the lowest-address region is used as a code segment.
The size of the code segment is determined by the program code.
For instance, an example in the {\tt\small program\char`_code} file has six instructions.
Since every RISC-V instruction is 4 bytes long, the program code is 24 bytes in size.
The program code is assumed to span the lowest memory addresses from 4 to 28, and this region is prohibited from data access.


\section{Running Kite Simulations} \label{sec:running}
A Kite simulation runs a program code based on the register and memory states.
After the program execution, Kite prints out the simulation results, including pipeline statistics (e.g., total clock cycles, number of instructions), cache statistics (e.g., number of loads and stores), and the final state of the register file and data memory.
The following shows the results of the Kite simulation executing {\tt program\char`_code}.

\begin{Verbatim}[frame=single]
$ ./kite program_code

***********************************************************
* Kite: Architecture Simulator for RISC-V Instruction Set *
* Developed by William J. Song                            *
* Intelligent Computing Systems Lab, Yonsei University    *
* Version: 1.11                                           *
***********************************************************

Start running ...
Done.

======== [Kite Pipeline Stats] =========
Total number of clock cycles = 48
Total number of stalled cycles = 6
Total number of executed instructions = 17
Cycles per instruction = 2.824

Data cache stats:
    Number of loads = 0
    Number of stores = 1
    Number of writebacks = 0
    Miss rate = 1.000 (1/1)

Register state:
x0 = 0
x1 = 0
x2 = 4096
x3 = 0
x4 = 0
x5 = 0
x6 = 0
x7 = 0
x8 = 0
x9 = 0
x10 = 3
x11 = 0
x12 = 0
x13 = 0
x14 = 0
x15 = 0
x16 = 0
x17 = 0
x18 = 0
x19 = 0
x20 = 32
x21 = 400
x22 = 720
x23 = 0
x24 = 0
x25 = 0
x26 = 0
x27 = 0
x28 = 0
x29 = 0
x30 = 0
x31 = 0

Memory state (all accessed addresses):
(2000) = 3
\end{Verbatim}

The default pipeline model does not make branch prediction and data forwarding.
The pipeline is stalled on every conditional branch instruction until its outcome is resolved.
The lack of data forwarding also stalls the pipeline when an instruction depends on its preceding instructions.
The pipeline stall is lifted as soon as the preceding instructions write their results to the registers that the stalled instruction depends on.

Kite offers branch prediction and data forwarding as optional features.
These options can be enabled by re-compiling Kite with {\tt OPT="-DBR\char`_PRED} {\tt -DDATA\char`_FWD"} flags, where {\tt -DBR\char`_PRED} enables branch prediction, and {\tt -DDATA\char`_FWD} enables data forwarding, respectively.
When both features are enabled, noticeable performance improvements (i.e., cycles per instruction) can be observed as follows.

\begin{Verbatim}[frame=single]
$ make clean
$ make -j OPT="-DBR_PRED -DDATA_FWD"
$ ./kite program_code

***********************************************************
* Kite: Architecture Simulator for RISC-V Instruction Set *
* Developed by William J. Song                            *
* Intelligent Computing Systems Lab, Yonsei University    *
* Version: 1.11                                           *
***********************************************************

Start running ...
Done.

======== [Kite Pipeline Stats] =========
Total number of clock cycles = 36
Total number of stalled cycles = 3
Total number of executed instructions = 17
Cycles per instruction = 2.118
Number of pipeline flushes = 4
Number of branch mispredictions = 4
Number of branch target mispredictions = 0
Branch prediction accuracy = 0.429 (3/7)

Data cache stats:
    Number of loads = 0
    Number of stores = 1
    Number of writebacks = 0
    Miss rate = 1.000 (1/1)

Register state:
x0 = 0
x1 = 0
x2 = 4096
x3 = 0

...

\end{Verbatim}

Kite provides a debugging option that prints out the detailed progress of instruction executions at every pipeline stage and every clock cycle.
To enable the debugging option, Kite has to be compiled with a {\tt -DDEBUG} flag as follows.

\begin{Verbatim}[frame=single]
$ make clean
$ make -j OPT="-DDEBUG" 
$ ./kite program_code

***********************************************************
* Kite: Architecture Simulator for RISC-V Instruction Set *
* Developed by William J. Song                            *
* Intelligent Computing Systems Lab, Yonsei University    *
* Version: 1.11                                           *
***********************************************************

Start running ...
1 : fetch : [pc=4] beq x11, x0, 10(exit) [beq 0, 0, 10(exit)]
2 : decode : [pc=4] beq x11, x0, 10(exit) [beq 15, 0, 10(exit)]
3 : execution : [pc=4] beq x11, x0, 10(exit) [beq 15, 0, 10(exit)]
4 : memory : [pc=4] beq x11, x0, 10(exit) [beq 15, 0, 10(exit)]
5 : writeback : [pc=4] beq x11, x0, 10(exit) [beq 15, 0, 10(exit)]
5 : fetch : [pc=8] remu x5, x10, x11 [remu 0, 0, 0]
6 : decode : [pc=8] remu x5, x10, x11 [remu 0, 21, 15]
6 : fetch : [pc=12] add x10, x11, x0 [add 0, 0, 0]
7 : alu : [pc=8] remu x5, x10, x11 [remu 6, 21, 15]
7 : decode : [pc=12] add x10, x11, x0 [add 0, 15, 0]
7 : fetch : [pc=16] add x11, x5, x0 [add 0, 0, 0]
8 : execution : [pc=8] remu x5, x10, x11 [remu 6, 21, 15]
8 : decode : [pc=12] add x10, x11, x0 [add 0, 15, 0]
8 : fetch : [pc=16] add x11, x5, x0 [add 0, 0, 0]
9 : memory : [pc=8] remu x5, x10, x11 [remu 6, 21, 15]
9 : execution : [pc=12] add x10, x11, x0 [add 15, 15, 0]
9 : fetch : [pc=16] add x11, x5, x0 [add 0, 0, 0]
10 : writeback : [pc=8] remu x5, x10, x11 [remu 6, 21, 15]
10 : memory : [pc=12] add x10, x11, x0 [add 15, 15, 0]
10 : decode : [pc=16] add x11, x5, x0 [add 0, 6, 0]
10 : fetch : [pc=20] beq x0, x0, -8(loop) [beq 0, 0, -8(loop)]

...
\end{Verbatim}

The example above shows that turning on the debugging option discloses the details of instruction executions in the pipeline.
For instance, the first instruction of the program code, {\tt beq}, is fetched at clock cycle \#1.
The debug message shows {\tt 1} {\tt :} {\tt fetch} {\tt :} {\tt[pc = 4] beq x11, x0, 10(exit) [beq 0, 0, 10(exit)]}.
The leading {\tt 1} indicates clock cycle \#1, and {\tt fetch} means that the instruction is currently in the fetch stage.
{\tt pc=4} shows the PC of the {\tt beq} instruction.
The remainder of the debug message shows the instruction format.
The last argument of the {\tt beq} instruction is an immediate value, which is the distance to the branch target in the unit of two bytes.
The {\tt beq} instruction in {\tt program\char`_code} branches to label {\tt exit}, which is translated to immediate value 10.
The repeated instruction format inside the squared brackets shows the actual values of register operands.
Since the registers are not yet read in the fetch stage, the values of {\tt x11} and {\tt x0} registers are simply shown as zeros.
The correct values of source operand registers appear when the instruction advances to the decode stage at cycle \#2.
The value of a destination operand register (if any) becomes available when an instruction reaches the execute stage.
As explained, turning on the debugging option shows the detailed progress of instruction executions in the pipeline.


\section{Implementation} \label{sec:implementation}
This section explains more advanced content about the Kite implementation.
Kite consists of the following files in the {\tt kite/} directory.

\begin{Verbatim}[frame=single]
$ ls | grep ".h\|.cc"

alu.cc           data_cache.cc   defs.h          inst_memory.h  proc.cc
alu.h            data_cache.h    inst.cc         main.cc        proc.h
br_predictor.cc  data_memory.cc  inst.h          pipe_reg.cc    reg_file.cc
br_predictor.h   data_memory.h   inst_memory.cc  pipe_reg.h     reg_file.h
\end{Verbatim}

\begin{itemize}
\leftskip-0.20in
\item
    {\tt main.cc} has the {\tt main()} function that creates, initializes, and runs Kite.
\item
    {\tt defs.h} defines the {\tt enum} list of RISC-V instructions (e.g., {\tt op\char`_add}, {\tt op\char`_ld)}, instruction types (e.g., {\tt op\char`_r\char`_\linebreak type}, {\tt op\char`_i\char`_type}), integer registers (e.g., {\tt reg\char`_x0}, {\tt reg\char`_x1}), and ALU latencies of all supported instructions.
    Kite supports multi-cycle ALU executions, and {\tt defs.h} defines that {\tt div}, {\tt divu}, {\tt mul}, {\tt remu}, {\tt remu}, {\tt fdiv.d}, and {\tt fmul.d} instructions take two cycles for an ALU to execute.
    All other instructions take one clock cycle.
    {\tt defs.h} provides several macros near the end of the file to convert strings to {\tt enum} lists (e.g., {\tt get\char`_op\char`_opcode()}, {\tt get\char`_op\char`_type()}), and vice versa.
\item
    {\tt inst.h/cc} files define a C++ class that represents a RISC-V instruction.
    RISC-V instructions in a program code (e.g., {\tt program\char`_code}) are read when a simulation starts.
    Each assembly instruction in the character string is converted into the {\tt inst\char`_t} class that carries the instruction information, including the program counter (i.e., {\tt pc}), register numbers (e.g., {\tt rs1\char`_num}), register values (e.g., {\tt rs1\char`_val}), immediate value (i.e., {\tt imm}), memory addresses (i.e., {\tt memory\char`_addr}), control directives (e.g., {\tt alu\char`_latency}, {\tt rd\char`_ready}), etc.
\item
    {\tt inst\char`_memory.h/cc} files implement the instruction memory.
    It supplies instructions to the processor pipeline, directed by the program counter (PC).
    In particular, the instruction memory creates a copy of an instruction and returns a pointer to it when fetching it.
    The processor pipeline works on the instruction pointer instead of a copy since passing the pointer is much faster than passing the sizable {\tt inst\char`_t} class.
    When a simulation starts, the instruction memory loads a program code that contains RISC-V assembly instructions.
    It parses the program code and stores the instructions using the {\tt inst\char`_t} class.
    The instruction memory provides the processor pipeline with instructions until the PC goes out of a code segment where there exists no valid instruction.
    PC = 0 is reserved as invalid, which makes the pipeline stop fetching instructions.
    The size of the code segment is determined by the program code.
\item
    {\tt pipe\char`_reg.h/cc} files define the {\tt pipe\char`_reg\char`_t} class that models a pipeline register connecting two neighboring stages such as IF/ID.
    The class has the following four functions.
    The {\tt read()} function probes the pipeline register and returns a pointer to an instruction if it holds one.
    This function does not remove the instruction from the pipeline register.
    Removing the instruction from the pipeline register requires an explicit invocation of the {\tt clear()} function.
    The {\tt write()} function writes an instruction to the pipeline register, and {\tt is\char`_available()} tests if the pipeline register has no instruction inside and thus is free.
\item{\tt reg\char`_file.h/cc} files implement a register file, {\tt reg\char`_file\char`_t}.
    The register file contains 32 64-bit integer registers (i.e., {\tt x0}-{\tt x31}).
    When a simulation starts, the register file loads a state file (i.e., {\tt reg\char`_state}) that defines the initial values of the registers.
    Notably, dependency-checking logic is embedded in the register file instead of probing pipeline registers to detect data hazards.
    Checking data dependency is simply done by maintaining a table that traces which instruction is the last producer of each register.
    When a branch misprediction occurs, the table is flushed to discard the information of wrong-path instructions.
\item
    {\tt alu.h/cc} files model an arithmetic-logical unit (ALU) that executes instructions based on their operation types.
    To support type-aware operations, the operand of an unsigned instruction such as {\tt divu} is cast to {\tt uint64\char`_t} to produce the correct result.
    It is assumed that instructions cannot be pipelined inside an ALU since doing so can potentially cause out-of-order executions.
    If the ALU gets a multi-cycle instruction such as {\tt div} or {\tt mul}, all subsequent instructions stall until the ALU becomes free.
\item
    {\tt data\char`_memory.h/cc} files implement the data memory that holds data values.
    When a simulation starts, the data memory loads a state file (i.e., {\tt memory\char`_state}) to set memory addresses to store the defined values.
    To align with 64-bit registers, the data memory requires that the memory address of a load or store instruction is a multiple of 8.
    The default size of the data memory is 4KB, which is modifiable in {\tt data\char`_memory\char`_t()}.
    Memory addresses belonging to the code segment are inaccessible because the code segment is prohibited from data access.
\item
    {\tt data\char`_cache.h/cc} files implement a data cache that stores a subset of memory blocks.
    A cache block can be any multiple of 8 bytes (e.g., 16-byte cache block).
    The default cache model implements a 1KB direct-mapped cache with 8-byte cache blocks.
    Other cache configurations, such as set-associative caches, can be easily extended from the default cache model, but they are not included in the baseline code for students to work as programming assignments.
\item
    {\tt br\char`_predictor.h/cc} files are used when the optional branch prediction is enabled.
    They define two classes, {\tt br\char`_predictor\char`_t} and {\tt br\char`_target\char`_buffer\char`_t}, representing a branch predictor and branch target buffer (BTB), respectively.
    The member functions of branch predictor and BTB are left nearly empty, again for students to work as programming assignments.
    The default branch predictor makes always-not-taken branch predictions, so the BTB is not technically used.
\item
    {\tt proc.h/cc} files implement a five-stage processor pipeline.
    The processor is defined as the {\tt proc\char`_t} class, which contains datapath components including the instruction memory ({\tt inst\char`_memory\char`_t}), register file ({\tt reg\char`_file\char`_t}), ALU ({\tt alu\char`_t}), data cache ({\tt data\char`_cache\char`_t}), data memory ({\tt data\char`_memory\char`_t}), branch predictor ({\tt br\char`_\linebreak predictor\char`_t}), BTB ({\tt br\char`_target\char`_buffer\char`_t}), and four pipeline registers ({\tt pipe\char`_reg\char`_t}) of IF/ID, ID/EX, EX/MEM, and MEM/ WB.
    The {\tt run()} function of {\tt proc\char`_t} executes the five-stage pipeline in a while-loop, which repeats as long as there are in-flight instructions in the pipeline.
    Note that the pipeline stages are executed backward from the writeback to fetch stages, not from the fetch to writeback stages.
    Each stage function (e.g., {\tt fetch()}, {\tt decode()}) performs necessary operations that need to take place in the corresponding stage.
    At the end of a program execution, {\tt proc\char`_t} prints out the summary of execution results, including the total number of clock cycles, stalled cycles, and executed instructions, followed by cycles per instruction, number of pipeline flushes and branch prediction accuracy (if branch prediction is enabled), and data cache statistics.
    The final state of the register file and data memory is also displayed.
    Only the accessed memory addresses and their final values are printed for the data memory; other untouched addresses are not shown.
\end{itemize}


\section{Contact} \label{sec:contact}
If you notice a bug or have a question regarding Kite, please contact Prof. William Song by email at \href{mailto:wjhsong@yonsei.ac.kr}{wjhsong@yonsei.ac.kr}.


\begin{thebibliography}{2}
\bibitem{waterman_riscv2019}
    A. Waterman and K. Asanovic, 
    ``The RISC-V Instruction Set Manual, Volume I: Unprivileged ISA,''
    \emph{SiFive Inc.} and \emph{University of California, Berkeley},
    June, 2019.
\bibitem{patterson_morgan2017}
    D. Patterson and J. Hennessey,
    ``Computer Organization and Design RISC-V Edition: The Hardware Software Interface,''
    \emph{Morgan Kaufmann}, 1st Ed.,
    Apr. 2017.
\end{thebibliography}

\end{document}
